-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(r): Implement dictionary conversion #285
Conversation
Codecov Report
@@ Coverage Diff @@
## main #285 +/- ##
==========================================
- Coverage 87.12% 86.93% -0.20%
==========================================
Files 66 63 -3
Lines 10246 9965 -281
==========================================
- Hits 8927 8663 -264
+ Misses 1319 1302 -17
... and 5 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
fa80a63
to
6898c6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very clean to me. I have only skimmed through the C changes, protection looks sane.
Co-authored-by: Kirill Müller <[email protected]>
Co-authored-by: Kirill Müller <[email protected]>
Co-authored-by: Kirill Müller <[email protected]>
Thank you for the review! |
This PR implements conversion from dictionary-encoded arrays to R vectors. To make this happen, it was also necessary to refactor a few things about the conversion process to make writing conversions in R easier. This is also with an eye towards extension type conversion (up next) where it is expected that extension types will mostly implement conversions as S3
convert_array()
method. Previously this didn't work if converting a stream with multiple batches...now it does (although will incur S3 dispatch overhead for each batch for conversions that are not handled in C).Conversion from a dictionary always defaults to the conversion that would happen from the dictionary value type. This is because in Arrow the dictionary is a property of the array, not the type (whereas in R, a factor's levels are a property of the type, sort of). For a dictionary-encoded array that arrives in chunks, the only type-stable way to perform the conversion is by resolving the dictionary into the value type. This also makes it a more predictable default for other dictionary-encoded types (none of which have an R analogue).
In practice this has low memory overhead because the most commonly dictionary-encoded type is a string, and because R has a global string pool; however, it is quite a bit slower than decoding directly to
factor()
.Like other conversions (and like
vctrs::vec_cast()
), you can also request a targetptype
. This is implemented forfactor()
. If converting an array stream containing dictionary-encoded values, thelevels
need to be explicitly specified.It's hard to benchmark this because ALTREP makes the effect of some conversions deferred, but it seems like the performance characteristics are vaguely similar to what happens in Arrow (probably Arrow is faster for many chunks):
Created on 2023-08-23 with reprex v2.0.2